Goto

Collaborating Authors

 research design


Q-Sat AI: Machine Learning-Based Decision Support for Data Saturation in Qualitative Studies

arXiv.org Artificial Intelligence

The determination of sample size in qualitative research has traditionally relied on the subjective and often ambiguous principle of data saturation, which can lead to inconsistencies and threaten methodological rigor. This study introduces a new, systematic model based on machine learning (ML) to make this process more objective. Utilizing a dataset derived from five fundamental qualitative research approaches - namely, Case Study, Grounded Theory, Phenomenology, Narrative Research, and Ethnographic Research - we developed an ensemble learning model. Ten critical parameters, including research scope, information power, and researcher competence, were evaluated using an ordinal scale and used as input features. After thorough preprocessing and outlier removal, multiple ML algorithms were trained and compared. The K-Nearest Neighbors (KNN), Gradient Boosting (GB), Random Forest (RF), XGBoost, and Decision Tree (DT) algorithms showed the highest explanatory power (Test R2 ~ 0.85), effectively modeling the complex, non-linear relationships involved in qualitative sampling decisions. Feature importance analysis confirmed the vital roles of research design type and information power, providing quantitative validation of key theoretical assumptions in qualitative methodology. The study concludes by proposing a conceptual framework for a web-based computational application designed to serve as a decision support system for qualitative researchers, journal reviewers, and thesis advisors. This model represents a significant step toward standardizing sample size justification, enhancing transparency, and strengthening the epistemological foundation of qualitative inquiry through evidence-based, systematic decision-making.


Generative Artificial Intelligence and Agents in Research and Teaching

arXiv.org Artificial Intelligence

This study provides a comprehensive analysis of the development, functioning, and application of generative artificial intelligence (GenAI) and large language models (LLMs), with an emphasis on their implications for research and education. It traces the conceptual evolution from artificial intelligence (AI) through machine learning (ML) and deep learning (DL) to transformer architectures, which constitute the foundation of contemporary generative systems. Technical aspects, including prompting strategies, word embeddings, and probabilistic sampling methods (temperature, top-k, and top-p), are examined alongside the emergence of autonomous agents. These elements are considered in relation to both the opportunities they create and the limitations and risks they entail. The work critically evaluates the integration of GenAI across the research process, from ideation and literature review to research design, data collection, analysis, interpretation, and dissemination. While particular attention is given to geographical research, the discussion extends to wider academic contexts. A parallel strand addresses the pedagogical applications of GenAI, encompassing course and lesson design, teaching delivery, assessment, and feedback, with geography education serving as a case example. Central to the analysis are the ethical, social, and environmental challenges posed by GenAI. Issues of bias, intellectual property, governance, and accountability are assessed, alongside the ecological footprint of LLMs and emerging technological strategies for mitigation. The concluding section considers near- and long-term futures of GenAI, including scenarios of sustained adoption, regulation, and potential decline. By situating GenAI within both scholarly practice and educational contexts, the study contributes to critical debates on its transformative potential and societal responsibilities.


Computationally Intensive Research: Advancing a Role for Secondary Analysis of Qualitative Data

arXiv.org Artificial Intelligence

This paper draws attention to the potential of computational methods in reworking data generated in past qualitative studies. While qualitative inquiries often produce rich data through rigorous and resource-intensive processes, much of this data usually remains unused. In this paper, we first make a general case for secondary analysis of qualitative data by discussing its benefits, distinctions, and epistemological aspects. We then argue for opportunities with computationally intensive secondary analysis, highlighting the possibility of drawing on data assemblages spanning multiple contexts and timeframes to address cross-contextual and longitudinal research phenomena and questions. We propose a scheme to perform computationally intensive secondary analysis and advance ideas on how this approach can help facilitate the development of innovative research designs. Finally, we enumerate some key challenges and ongoing concerns associated with qualitative data sharing and reuse.


Science in the Era of ChatGPT, Large Language Models and Generative AI: Challenges for Research Ethics and How to Respond

arXiv.org Artificial Intelligence

Since the release of popular large language models (LLMs) such as ChatGPT, the transformative impact of artificial intelligence (AI) on broader society has been unprecedented. This is particularly alarming for science and its conquest of truth (Chomsky et al., 2023). Generative AI and, particularly, conversational AI based on language models has set new ethical dilemmas for knowledge, epistemology and research practice. From authorship, to misinformation, biases, fairness and safety of interactions with human subjects, research ethics boards need to adapt to this new era in order to protect research integrity and set high-quality ethical standards for research conduct (van Dis et al., 2023). This paper focuses on reviewing these challenges with the aim of laying foundations for a timely and effective response. ChatGPT is an AI chatbot released in November 2022 by OpenAI. It is a Generative Pre-trained Transformer (GPT), a type of artificial deep neural network with a number of parameters in the order of billions. It is designed to process sequential input data, i.e. natural language, without labeling (self-supervised learning), but with remarkable capabilities for parallelization that significantly reduce training time. The model is further enhanced by a combination of supervised and reinforcement learning based on past conversations as well as human feedback to fine-tune the model and its responses (Stiennon et al., 2020; Gao,


The different tribes of data scientists

#artificialintelligence

Something that many employers are not aware of is that there is more than one path to become a data scientist. Data science, after all, is an umbrella term that encapsulates other fields, such as AI, machine learning and statistics. Since the rise of popularity of data science many people try to promote themselves as a data scientist, whereas in the past they might have used a different term. Also, many people are trying to get the right education. Understanding someone's background better can also help you make better hiring and management decisions.


No More "What" Without the "Why"

#artificialintelligence

Throughout the last months, I had the chance to enable various organizations and leaders leveraging their large databases with machine learning. I was particularly engaging with member organisations which struggle with rising dropout rates (churns) -- an issue that became even more serious throughout the pandemic when individual income has been on a declining and the fear of job loss on a rising path. With machine learning, we used very large membership databases with individual-level information (e.g. Machine Learning tells us the "What", Causal Inference the "Why" Despite the overall good performance of the machine learning models, our clients were always interested in one obvious question: Why does an individual member leave? Unfortunately, machine learning models are not suited to identify the causes of things but rather they are built to predict things.


How Large a Sample Do You Need?

#artificialintelligence

And is sample size even the right question? I regularly conduct both qualitative and quantitative research. Regardless of qual or quant I'm often asked about sample size. An economist might balk at the idea that you can get value from 5–10 1 hour interviews. Yet there is a lot of value in qualitative research that can't be achieved with quant. And quantitative research has its own perils -- including sample size issues. Questions about sample size are more complex than they appear. A proper answer requires nuance. It depends on the theoretical justification for the results, the effect size observed, the number of hypothesis, and more.


Accounting for Unobservable Heterogeneity in Cross Section Using Spatial First Differences

arXiv.org Machine Learning

We propose a simple cross-sectional research design to identify causal effects that is robust to unobservable heterogeneity. When many observational units are adjacent, it may be sufficient to regress the "spatial first differences" (SFD) of the outcome on the treatment and omit all covariates. This approach is conceptually similar to first differencing approaches in time-series or panel models, except the index for time is replaced with an index for locations in space. The SFD approach identifies plausibly causal effects so long as local changes in the treatment and unobservable confounders are not systematically correlated between immediately adjacent neighbors. We illustrate how this approach can mitigate omitted variables bias through simulation and by estimating returns to schooling along 10th Avenue in New York and I-90 in Chicago. We then more fully explore the benefits of this approach by estimating effects of climate and soil on maize yields across US counties. In each case, we demonstrate the performance of the research design by withholding important covariates during estimation. SFD has multiple appealing features, such as internal robustness checks that exploit rotation of the coordinate system or double-differencing across space, it is immediately applicable to spatially-gridded data sets, and it can be easily implemented in statical packages by replacing a single index in pre-existing time-series functions.


Data Science Research & Development - Internship - Civis Analytics

#artificialintelligence

Are you passionate about model strategy and research design? Do you want to learn from data scientists and have an immediate impact on our work? Civis Analytics is looking for an Data Science Research and Development intern to join our team! Civis Analytics was born on the campaign trail, with CEO Dan Wagner and our founding members spearheading the 2012 Obama for America analytics team. Since then, our DC and Chicago teams have been building software and growing rapidly among a steadily developing client base in education, energy, government, healthcare, media, nonprofits, and politics.


Data science through the lens of research design

#artificialintelligence

You are a data scientist or engaged in a data science project in your organization. You have one of the most interesting, influential, and intellectually stimulating jobs on the market. You've mastered stats, machine learning, become a programming wizard, an expert in visualization, a big data evangelist, and a math god. These last three years, our group has lead numerous data science projects across diverse verticals, including ad tech, fin tech, health tech, cloud computing, security, and the telecom industry. Surprisingly, many of our projects share similar attributes despite originating from different domains.